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Abstract 

We integrate automatic speech recognition (ASR) and 
question answering (QA) to realize a speech-driven QA 
system, and evaluate its performance. We adapt an N- 
gram language model to natural language questions, so 
that the input of our system can be recognized with a high 
accuracy. We target WH-questions which consist of the 
topic part and fixed phrase used to ask about something. 
We first produce a general N-gram model intended to rec- 
ognize the topic and emphasize the counts of the N-grams 
that correspond to the fixed phrases. Given a transcription 
by the ASR engine, the QA engine extracts the answer 
candidates from target documents. We propose a passage 
retrieval method robust against recognition errors in the 
transcription. We use the QA test collection produced in 
NTCIR, which is a TREC-style evaluation workshop, and 
show the effectiveness of our method by means of exper- 
iments. 

1. Introduction 

Question Answering (QA) was first evaluated extensively 
at TREC-8 |9|. The goal in the QA task is to extract 
words or phrases as the answer to a question, rather than 
the document lists obtained by traditional information re- 
trieval (IR) systems. Speech interfaces have promise for 
improving the utility of QA systems, in which natural 
language questions are used as inputs. We enhanced our 
speech-driven IR system |5 1 to accept spoken questions. 

In this paper, we evaluate the effects of language 
modeling on speech-driven question answering. In past 
literature, language models were evaluated independent 
of specific tasks. Perplexity is one of the common mea- 
sures to evaluate language models, irrespective of the 
speech recognition accuracy. Word error rate (WER) is 
another common measure, which directly evaluates the 
accuracy of speech recognition. However, it is not clear 
that they can evaluate the performance of specific infor- 
mation processing systems using speech interfaces. Be- 
cause question answering is one of the well-defined tasks 
and has been evaluated by formal evaluation workshops, 
e.g., TREC and NTCIR, we can evaluate components of 



a system, in particular language modeling, through a rig- 
orous method. 

Section 12] describes our language modeling method 
for speech-driven question answering [2J. Section 13 de- 
scribes our question answering engine |3|. Sectionl^de- 
scribes the experimental results. 

2. Language Modeling for Question 
Answering 

Question answering systems accept a question consisting 
of the part that conveys a topic and the part that represents 
a fixed phrase for question sentences. The following is an 
example question: 

seN / kyu-/ hyaku/ nana / ju- / roku / neN / ni 
/ kasei / ni / naN / chakuriku / shita / taNsaki 
/ wa / naN / to / yu- / namae /desu / ka 
(What was the name of the spacecraft that 
landed safely on Mars in 1976 ?) 

The first half of the question, i.e., '\seN kyu- hyaku nana 
ju- roku neN ni kasei ni naN chakuriku shita taNsaki 
wa (the spacecraft that landed safely on Mars in 1976)", 
conveys the topic, and can be recognized by an N-gram 
model trained with target documents (e.g., newspaper ar- 
ticles). The latter half of the question, i.e., "naN to yu- 
namae desu ka (What was the name?)", is a fixed phrase 
typically used in interrogative questions, which is not 
very frequent in newspaper articles. Thus, we need a lan- 
guage model adapted to both types of expressions. 

Note that recognizing the fixed phrases with high ac- 
curacy is crucial in question answering, because these 
phrases convey clues to determine the question and an- 
swer types. For example, a fixed phrase indicates that the 
answer should be the name of an object as in the previous 
example, while another question can potentially indicate 
that the answer should be the date of an event (e.g., "On 
what date was..."). 

In this paper, we use our previous method j2|, in 
which a language model for question answering are pro- 
duced from a list of the fixed phrases typically used in 
interrogative questions. This method emphasizes the N- 
gram subset corresponding to the fixed phrases. This 



method can be recast as a variant of maximum a poste- 
riori probability (MAP) estimation, in which the N-gram 
subset of a background corpus is used as a posterior dis- 
tribution. 

2.1. Language Modeling by Emphasizing N-gram 
Subsets 

Let be a set of sentences. Let ^fp be a subset of S that 
consists of the sentences including the fixed phrases in a 
list. Let P be a language model of generating sentences 
(i.e., s e 5) obtained from a general-purpose background 
corpus. The aim of the language model adaptation for 
the fixed phrases is to obtain the adapted language model 
P', which provides higher probability scores for sentence 
§ G Sfp but maintains the order relations on sentences 
s G S — Sfp SiS, much as possible. 

The adapted model P' is produced by the following 
two steps. 

1 . Revise the maximum likelihood estimates of P: 

PML{l){m), PML{2){Wi\Wi - 1), ■ • ■ 
•■■ ,PMLiN){w^\wlZN+l) 

which are calculated for each value of n(l < n < 
N). 



- Fixed Phrase- 



2. Apply the back-off smoothing to integrate the 
revised ML estimates -PML(n) (^il^l-n+i)(l - 
n < N). 



For each value of n{l < n < N), the maximum like- 
Uhood estimates PML{n)iwi\'wlZn^i) of N-gram proba- 
bility P obtained from the background corpus are revised 
to P' ML by the following procedure. 



(1). If the postfix Wi-k+i 
the word sequence Wi- 



p re fix lip 



■Wi{l < k < n) of 
■ ■ ■ Wi is equal to the 
of one of the fixed phrases 



Wp ■ ■ ■ Wq then emphasize the Pml as follows: 

lPML(n){Wp+k-l\wl 

Otherwise, go to step\(2)\ 



(2). If the word sequence Wi^n+i ■ ■ ■ Wi is equal to 
the subsequence ?j)i_„_|_i ■ ■ ■ Wi of one of the fixed 
phrases Wp ■ ■ ■ Wq then emphasize only the longest 
N-gram probability Pml(n) follows: 

P'ML{N)im\wlZN^^) = 

""^l-TV+l) ' 1 Pm L(N){Wi\w\-N +i) 



Otherwise, go to step\(3)\ 
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Figure 1 : Emphasizing trigram counts. 



(3). For all n(l < n < N), the revised probability is: 

P'ML{n)iw^\wlZl,+ l) = 



Here, 7(> 1) is a multiplier that emphasizes the selected 
N-grams, and /3i(e) • • ■ /3jv(w'iZAr+i) are normalized co- 
efficients so that the probabilities add up to one. 

This can be seen as the task adaptation process by 
maximum a posteriori probability (MAP) estimation |4|, 
in which the N-gram subset corresponding to the fixed 
phrases is used as task specific data for adaptation. That 
is, P' ML is equivalent to the maximum likelihood esti- 
mate calculated as follows. 



P'ML{n)iWi\wl_l^l) = 



C'niwl 



-71 + lJ 



where the N-gram counts C'„ of each value of n(l < 
n < N) are obtained by emphasizing the selected subset 
of the original N-gram counts C, as shown in Fig. [2 

3. Question Answering Engine 

3.1. Question Answering as a Search Problem 

The question answering process is often seen as the se- 
quence of the question analysis, the relevant document 
(or passage) retrieval, answer extraction and answer se- 
lection processes. In this paper, we recast these processes 
as a search problem. 

q and doc- 
the substrings 

e D,ps < 

p/; Pj, and are positions in d}, by using a 
evaluation function L{a\q) defined on a G S, 
search the most appropriate answer a such that 

a = argmax^g5i(a|q). 

This defines the problem of finding a single best answer, 
which corresponds to the factoid question in TREC and 
the subtask 1 of NTCIR Question Answering Challenge 
(QAC). 

3.2. Passage Retrieval 

The evaluation function L is constructed in various as- 
pects. One of them is the similarity between the ques- 
tion and the context of an answer candidate. Selecting 
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Figure 2: WERs on spoken questions (BASE: baseline method, EMP: our method). 



the context, or passage retrieval, is one of the common 
research topics for question answering |8 1. 

Because by definition speech-driven question answer- 
ing accepts a result of speech recognition as an input, 
which often includes errors, the passage retrieval must be 
robust against those errors. We propose a dynamic pas- 
sage retrieval method that can accept an input including 
misrecognized words. 

Suppose, from given query q, we select the context of 
an answer candidate a, which belongs to sentence Si of 
document d — siS2 ■ ■ ■ Si ■ ■ ■ Sn- Let s'i — Si~ {a], h be 
the headline of d, and t be the string "Kotoshi Kongetsu 
Kyou" (this year, this month, today). Given a number k > 
0, let = Si_fc, • ■ • , Si_i, s'i, Si+i, • • • , Si+fc}. 

The optimal context Ci is selected from Ci G 2"^' by 
maximizing the following evaluation measure F{Ci). 



1 



/32 



1 

Score ((J A Cj) 

Score (q) 
Score(q A C^) 
Score(Ci) 



Here, Score(A) is a sum of the IDFs (inverse document 
frequencies) of the elements in A and Score(74 A B) is a 
sum of the IDFs of the elements appeared commonly in 
A and B. 

We used k — 1 for our experiments. The measure F 
corresponds to the (weighted) F-measure often used in IR 
research. The recall is more influential than precision in 
calculating the F-measure, if the value of /3 is more than 
one. Because the recall is important for selecting answer 
candidates, we set (3 = 2. 



4. Evaluation 

4.1. NTCIR Question Answering Challenge 

The test collection constructed in the first evaluation of 
Question Answering Challenge (QAC-1) f6l, which was 
carried out as a task of NTCIR Workshop 3, was used 
as the test data for our evaluation. The task definition of 
QAC-1 is as follows. 

Target documents are two years of Japanese newspa- 
per articles, from which the answers of a given question 
must be extracted. The answer is a noun or a noun phrase, 
e.g., person names, organization names, names of various 
artifacts, money, size and date. Three subtasks were per- 
formed in QACl, among which the sub task 1 is defined 
as follows. 

System extracts at most five answers from 
the documents for each question. The recip- 
rocal number of the rank is the score for the 
question. For example, if the second answer 
candidate is correct, the score is 0.5. 

This definition is almost equivalent to the factoid ques- 
tion answering in TREC. The 200 queries were used for 
the formal evaluation, in which no answer was found for 
four questions in the target documents. Mean Reciprocal 
Rank (MRR) of the 196 queries was used to evaluate the 
performance of participant systems. 

4.2. Experimental Results 

The effects of language modeling on question answer- 
ing were experimentally investigated. We extracted N- 
gram counts from newspaper articles in 1 1 1 months. The 
vocabulary size was 60,000. We produced a word net- 
work for the Japanese fixed phrases used for question 
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Figure 3: QA performance by spoken questions. 



investigates whether the difference in performance is 
meaningful or simply due to chance. We found that the 
MRR values for BASE and EMP were significantly dif- 
ferent (at the 5% level). 

5. Conclusion 

In this paper, we proposed a speech-driven question an- 
swering (QA) system and evaluated its performance, fo- 
cusing mainly on the effects of language modeling. For 
evaluation purposes, we used the test questions in the 
NTCIR collection, read by eight human subjects. The 
experimental results showed that our language model- 
ing method improved the accuracy of recognizing spo- 
ken questions and consequently the accuracy of question 
answering. At the same time, when compared with text- 
based QA, the performance of speech-driven QA system 
was not satisfactory from a practical point of view. Future 
work includes improving each module through a glass- 
box error analysis and extending our system to sponta- 
neously spoken questions 1 1 1 . 



sentences. From the network, we extracted the 172 fixed 
phrases accepted by the network. 

We used the N-gram model produced only from the 
newspaper articles as the baseline (BASE). For our pro- 
posed method, we emphasized the N-gram counts corre- 
sponding to the fixed phrases, and produced the adapted 
model (EMP). The magnification parameter 7 was set to 
50, which had been determined by our previous experi- 
ments |2|. 

All of the 200 questions in the QAC-1 test collection 
were used for our experiments. We produced our spo- 
ken question data set. The questions were read by four 
females (FOOl, F002, F003 and F004) and four males 
(MOOl, M002, M003 and M004). An existing LVCSR 
system pij was used for the purpose of transcription. 

The WERs of the results of speech recognition are 
shown in Figure 121 BH denotes WERs for an entire sen- 
tence, while FH and LH denote WERs for the first and 
latter halves of a sentence, respectively. We divided each 
sentence into the first and latter halves by using Japanese 
WH-words as the boundary (the latter half must include 
the WH-word), and investigated the WERs of both halves 
independently. Note that the latter halves roughly corre- 
spond to the fixed phrases used in interrogative questions. 
Figure |2] suggests that the proposed method (EMP) sig- 
nificantly decreased the WER for the fixed phrases (LH), 
while it did not decrease the WER for the other parts of 
the input sentences (FH). 

Figurel^shows the result of question answering using 
both text inputs, which correspond to the speech inputs 
with no error, and speech inputs by the eight speakers. 
For speech inputs, the best hypothesis from the LVCSR 
system was used as the input of the question answering 
engine. The result shows that the speech input decreased 
the performance almost by half. However, when using 
the proposed method of language modeling (EMP), the 
MRR was increased by 0.03 points on average. 

We used the paired t-test for statistical testing, which 
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